Cleaning The Data

It’s very rare that a dataset comes in the exact format that you want it. There might be unnecessary/distracting columns (as you’ll soon see with the Texas Lege data), empty cells, confusing column names, etc. This is why it’s always good practice to “clean” the data and get it in proper shape before you start analyzing it. How do you clean it? With code! You can use simple lines of code to rename columns, remove or select columns, merge data from separate tables together, and more.

Before we get into that, let’s open the README file to learn about the variables included in each dataset.

Read about each CSV

Click on the README.md file on the right side to open the document. It should appear in the top left window. If you want to collapse the console/terminal for a better view, click on the small box in the top right corner of the console window.

Now you can scroll through the README and learn more about each dataset, including the column names and a description of what the data in that column is. The README comes directly from the developers at LegiScan; this is a nice explanatory document that many databases will provide with raw data, but it isn’t always included.

As I said in the last section, we’re only going to be working with the bills, people, and sponsors datasets, so definitely take a look at those. If you want to close the document, click on the small “x” next to the file name (right above the toolbar). But you can also keep it open as we work in our cleaning file.

Setup

Before we start coding, we need to install and load the packages that are going to help facilitate the data analysis. You don’t need to worry about what each package does, but this is an important setup step that allows us to use specific functions later on. I always include tidyverse and janitor at the top of every R file.

If you want to learn more about what each of those packages does, you can read about them here and here. But again, not super necessary at the moment.

Installing the packages

Expand your console at the bottom of the page, type the following command, and hit enter:

install.packages("tidyverse")

It’s going to immediately start spewing out a bunch of scary red text – don’t worry, that actually means it’s working. It’s installing all of the tidyverse packages (and there’s quite a few) to the project so you can run them later on. It may take a minute or two to finish. You’ll know it’s done if you see the message “The downloaded source packages are in” and this symbol reappears: >

Next, do the same thing for janitor.

install.packages("janitor")

This one shouldn’t take as long.

Load them into the cleaning file

Great! Click your cleaning.qmd file to open it up. At the top, write a little description to remind your future self what this file was for, something like “In this file I cleaned the bills, people, and sponsors data from LegiScan and merged the three together.”

Next, insert a code chunk (see tip below), and add a “Setup” label within the R bracket. Adding a title within the bracket allows you to name the code chunk, which helps you keep track of what’s going on in each block of code. Another good organizational practice to get in the habit of.

Tip

Add a “code chunk” to your file by hitting Command+Option+i on the keyboard. You’ll run all of your code inside these little blocks, so get used to this shortcut!

Copy and paste the code below and “run” it by hitting the green arrow in the top right of your code chunk. You’ll see some sort of message pop up, but as long as it doesn’t say “Error” you’re fine to move on.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
Tip

To run code without clicking on the green arrow, hit Command + Enter on the keyboard. Another good one to get used to.

And that’s all you need to worry about for setup!

Important

posit.cloud does not auto save!! So make sure to hit Command + S frequently as you work on this.

Load the raw data

Next, you need to tell R that you want it to read the CSV files from your data-raw folder so they appear below. The function to do this is aptly named read_csv().

Create a new code chunk and label it “import bills” inside the R bracket. Run the code below to import the bills data into this file. No data should pop up yet, but by running this code, you’ve created an object called “bills” and have told R that you want to read the csv and save that data inside your new object.

Note

By putting “data-raw/bills.csv” inside the function, you’re basically telling R: inside the data-raw folder, read the file bills.csv

bills <- read_csv("data-raw/bills.csv")
Rows: 9897 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): bill_number, status_desc, title, description, committee, last_acti...
dbl  (4): bill_id, session_id, status, committee_id
date (2): status_date, last_action_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now, either in the same code chunk or a new one, type bills and run it. Now you should see a table with 14 columns and thousands of rows of data! Feel free to click through it.

bills 

Repeat these two steps for the “sponsors” data and “people” data. Remember to name your code chunks!

sponsors <- read_csv("data-raw/sponsors.csv")
Rows: 14190 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): bill_id, people_id, position

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sponsors 
lawmakers <- read_csv("data-raw/people.csv")
Rows: 181 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): name, first_name, middle_name, last_name, suffix, nickname, party,...
dbl  (7): people_id, party_id, role_id, followthemoney_eid, votesmart_id, kn...
lgl  (1): opensecrets_id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
lawmakers

Clean the tables

You probably noticed while clicking through that these tables aren’t as clean as they could be. Some of the column names don’t make sense, there are a lot of “NA” cells, and some of the information is just unnecessary for what we want to look at in this project. Let’s go through each table one at a time and clean them up.

1. Bills

If you go back and look at the bills table, you can see it includes a lot of different columns, some of which are just links. We’re going to use the select() function to remove those columns. We can also take out the session_id column because we know all of this data is from the 89th legislative session, so that value is going to be the same for every entry.

Create a new code block, name it, and then copy and run the code below. By putting a minus sign in front of the column name, you’re telling R: don’t include this column in the new table.

bills_clean <- bills |> 
   select(-session_id, -url, -state_link) 
bills_clean
Note

Notice how I saved this in a new object called “bills_clean.” This means the untouched data still exists in the object “bills” if you ever wanted to run that and view it again. It’s good practice to create new objects as you change the data and rename it to something that makes sense.

Now you can click through the table to see that the url, state link, and session id data is gone.

2. Lawmakers

Looking at the people data, which we’ve saved as “lawmakers,” you can see there are a lot of specific ids for other open data portals, like Follow the Money, Open Secrets, Ballotpedia, and KnowWho. While these are cool to have, we don’t need them for the analysis we’re running today. If you were writing a story about legislative campaign funds in Texas, for example, you might chose to keep the Follow the Money column.

To keep things simple (and to make our table a little smaller) let’s remove those columns. We’re still going to use the select() function by in the opposite way, this time telling R which columns you do want to include.

lawmakers_clean <- lawmakers |> 
  select(people_id, name, middle_name, suffix, party, role, district) 
lawmakers_clean

You could do this the same way we did for bills, which was use select(-) and tell R the names of the columns you want to remove. It doesn’t really matter, I just wanted to demonstrate both options.

3. Sponsors

Sponsors is a much smaller data table, with only three columns. As noted in the README, “position” refers to where a lawmaker appears on the sponsor list for a specific bill. For example, if a lawmaker is the primary sponsor for a bill, their position number would read “1” … if they were the second person to sponsor that bill, it would read “2” … etc.

All three of these columns will be useful to us, so we’re not going to remove anything.

However, this is a good chance to try the rename() function, which allows you the change the name of a column. Personally, I want to remind myself that the position number is associated with sponsor order, so I want to rename the column to “spons_position”

sponsors_clean <- sponsors |> 
  rename(spons_position = position)
sponsors_clean

Now you can see the new column name when you run the table.

Tip

If you want to choose a name that makes more sense to you, use this convention in the code: rename([new column name] = [original column name])

Merge the tables together

Now that we’ve cleaned up each table, we’re going to combine everything together into one big table to use for our analysis. We’ll do this using the merge() function.

Since we’re combining three different tables, we’ll do this in two steps. First, copy and run the code below (in a new code chunk) to merge the lawmaker and sponsor tables together. I’ve created a new object called “law-spons”

law_spons <- merge(lawmakers_clean, sponsors_clean) 
law_spons

Great! That’s most of the data we need, but we’re still missing some information from the bills data. Run the code below in a new chunk to add it.

Note

I’ve also renamed the middle_name column to say mid_init because that’s more accurate, and I’m picky.

final_table <- law_spons |> 
  left_join(bills_clean) |> 
  rename(mid_init = middle_name) 
Joining with `by = join_by(bill_id)`
final_table

Amazing! You should have one big data table with the following information:

  • people_id = the unique number associated with each lawmaker

  • name = lawmaker’s first and last name

  • mid_init = lawmaker’s middle initial

  • suffix = lawmaker’s suffix, if applicable

  • party = political party of each lawmaker

  • role = whether the lawmaker is a senator or a representative

  • district = the congressional district associated with each lawmaker

  • bill_id = the unique number associated with a specific bill filed this session

  • bill_number = standard bill naming in the legislature

  • spons_position = for a specific bill, this number indicates where a lawmaker is in the line of sponsor (ex: “1”=primary sponsor, “7”=seventh sponsor, etc.)

  • status = a numerical representation of where the bill currently sits (ex: “0”=no progress, “1”=Introduced, “2”=Engrossed, “4”=Passed) There is no 3, I don’t know why.

  • status_desc = written description of the status

  • status_date = date a specific bill achieved the corresponding status

  • title = shorter description of bill as written in LegiScan

  • description = longer description of bill as written in LegiScan

  • committee = which Senate/House committee the bill was referred to

  • committee_id = unique code associated with each committee

  • last_action = most recent update to bill as of the date you downloaded the data

  • last_action_date = calendar date of that update

Phew! This is a lot of information to work with. We may not end up using every column, but it’s still important to have a full picture of the data available. The table has over 14,000 rows of data, each corresponding with a bill/joint resolution. Now you can see the appeal of sorting through everything in R, rather than manually or in a spreadsheet.

Note

There aren’t acutally 14,000 unique bills in session because some of those are “companion bills” which are the same exact bill, but one is introduced in the House and the other is introduced in the Senate.

We’ll be able to quickly sort through all of this information and identify key details about the agendas of Texas lawmakers in the current legislative session. Let’s save this table before we move forward with our analysis.

Save the clean table

Add instructions for creating a data-processed file.

final_table |> 
  write_rds("data-processed/bills-89.rds")

In the next chapter of this toolkit, we’ll brain storm some potential questions about the session and use the data to find some answers.